-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Add optimized UTF-8 validation and transcoding apis, hook them up to UTF8Encoding #21948
Add optimized UTF-8 validation and transcoding apis, hook them up to UTF8Encoding #21948
Conversation
One thing I struggled with here is how to balance readability / safety of code vs. performance. Since validation and transcoding are going to be such common operations, I weighted the decision heavily on the side of highest possible performance. This should be fleshed out by the benchmarks when I rerun them. Many of the more obtuse lines of code exist because they cause the JIT to emit very specific and optimized assembly. I'm hoping that a combination of code comments and test coverage makes the code more maintainable. |
/cc @ahsonkhan and @bartonjs who may find this interesting because it'll be followed by the transcoding and escaping APIs you've been asking for. :) |
src/System.Private.CoreLib/shared/System/Text/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Text/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Text/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Text/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
It would be nice to have a list of these obtuse lines in this PR. Corelib is becoming rather quickly a collection of such hacks and removing them when the JIT improves might not be so easy. |
- Hook it up through the existing Utf8 public static APIs - Move some shared methods out of ASCIIUtility - Hook it up through the Utf8String ctor
585180f
to
abd7add
Compare
This is still WIP. Todo:
|
- Add vectorized UTF-16 validation and transcoded byte counts - Move Utf16Utility into Unicode namespace alongside Utf8Utility - Fix some bugs in DecoderNLS's draining logic
Removed WIP marker. This is now code-complete. There are some tests failing in corefx since they (incorrectly) assume things about the output of transcoding invalid data. I'll send a separate PR through corefx with those fixes. In the meantime, once I build the full list of failing tests that need to be changed I'll also update the global suppression file in this repo. Benchmarks are pending. High-level summary of changes:
|
@mikedn I've tried to annotate the "hacks" where they appear throughout the code, including with links back to GitHub issues. |
Quick benchmarks, using various corpus texts from Project Gutenberg.
These texts are large (~100KB). I'll work on getting benchmarks for smaller texts as well. |
RyuJIT now handles this optimally
src/System.Private.CoreLib/shared/System/Text/ASCIIUtility.Helpers.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Text/ASCIIUtility.Helpers.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Text/ASCIIUtility.Helpers.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Text/ASCIIUtility.Helpers.cs
Outdated
Show resolved
Hide resolved
Supported commands help: Get descriptions, examples and documentation about supported commands Example: help "command_name" run: Run all pipelines or a specific pipeline for this repository using a comment. Use this command by itself to trigger all related pipelines, or specify a pipeline to run. Example: "run" or "run pipeline_name" See additional documentation. |
/azp run coreclr-ci (Build Linux_musl x64 checked) |
No pipelines are associated with this pull request. |
src/System.Private.CoreLib/shared/System/Text/Unicode/Utf16Utility.Validation.cs
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Text/Unicode/Utf8Utility.Transcoding.cs
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Text/Unicode/Utf8Utility.Transcoding.cs
Outdated
Show resolved
Hide resolved
throw new ArgumentNullException("s"); | ||
// Validate input parameters | ||
|
||
if (chars is null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing identical codegen for == null
and is null
on string. Why the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cf. #21948 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfoidl I'm seeing that both patterns just reduce to br.true or br.false from the compiler (with both string and array on single or multiple arg). So while it may be true on types overloading operator== other than string, I'm challenging @tannergooding's assertion that it matters.
Once it doesn't matter, we're back to what I once saw on a poster: "change is terrible, unless it's awesome". In this case, if there are two ways of identically doing the same thing, the one with fewer lines of diff is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied + pasted from the ASCIIEncoding
class, which is acting as the source of truth for these implementations and will eventually roll out to UnicodeEncoding
, UTF32Encoding
, etc. It's more lines in this particular diff, but it's overall less diff if you compare ASCIIEncoding.cs and UTF8Encoding.cs against each other.
return GetChars(bytesPtr, bytes.Length, charsPtr, chars.Length, baseDecoder: null); | ||
ThrowHelper.ThrowArgumentOutOfRangeException( | ||
argument: (byteCount < 0) ? ExceptionArgument.byteCount : ExceptionArgument.charCount, | ||
resource: ExceptionResource.ArgumentOutOfRange_NeedNonNegNum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like you do this enough that you'd have a dedicated demuxer helper.
ThrowHelper.ThrowNeedNonNegNum(
byteCount,
ExceptionArgument.byteCount,
ExceptionArgument.charCount);
...
void ThrowNeedNonNegNum(int firstVal, string firstName, string secondName)
{
ThrowArgumentOutOfRangeException(
firstVal < 0 ? firstName : secondName,
ExceptionResource.ArgumentOutOfRange_NeedNonNegNum);
}
Just to cut down on the size of the boilerplate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Amusingly enough, it's written this way because it minimized the diff in the ASCIIEncoding.cs case, and the code got copied to this file so that the diff between ASCIIEncoding.cs and UTF8Encoding.cs would be minimized.
There are some other changes / optimizations I want to make here that should further reduce the overhead of these method calls. Will open a new PR so that I can address them in ASCIIEncoding.cs and here at the same time.
if (baseEncoder != null) | ||
if (((decoder is null) ? this.DecoderFallback : decoder.Fallback) is DecoderReplacementFallback replacementFallback | ||
&& replacementFallback.MaxCharCount == 1 | ||
&& replacementFallback.DefaultString[0] == UnicodeUtility.ReplacementChar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing tests are in corefx; but ensure that there's a custom replacement fallback of "(ReplacementChar)!" (StartsWith, not a count of 1) and a custom replacement to "."
(right length, wrong char). One to "badger" or some other completely-not-the-thing-we're-looking-for is probably also fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/System.Private.CoreLib/shared/System/Text/Unicode/Utf16Utility.Validation.cs
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Text/Unicode/Utf16Utility.Validation.cs
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Text/Unicode/Utf8Utility.Helpers.cs
Show resolved
Hide resolved
Thank you all @gfoidl @ahsonkhan @bartonjs @mikedn @tannergooding for the thoughtful feedback you've all given throughout this process. I know it was a slog. Your assistance is much appreciated! 🎉 |
CI failures are due to #23902, which has already been fixed. Going ahead with this as-is. |
@GrabYourPitchforks could you squash and merge next time? "Clarify comment in 3-byte processing", "Add missing check to 3-byte processing logic", "PR feedback: Fix typos" and similar clutters the git history. |
@MichalStrehovsky I did, including writing an actual real description for the commit, but apparently if GitHub encounters a server-side error during the process it automatically falls back to a normal merge commit. 😒 |
…UTF8Encoding (dotnet/coreclr#21948) * Add optimized UTF-8 validation and transcoding logic - Hook it up through the existing Utf8 public static APIs - Move some shared methods out of ASCIIUtility - Hook it up through the Utf8String ctor * Hook up new UTF-8 logic through UTF8Encoding - Add vectorized UTF-16 validation and transcoded byte counts - Move Utf16Utility into Unicode namespace alongside Utf8Utility - Fix some bugs in DecoderNLS's draining logic * Improve perf of "is ASCII?" inner loop in UTF-8 validation. * Remove SSE41.X64 optimization from AsciiUtility RyuJIT now handles this optimally * Clarify that vector read is unaligned * Simplify vectorized logic; remove unnecessary adjustment * PR feedback: GetElement(0) -> Sse2.StoreLow * PR feedback - Simplify CountNumberOfLeadingAsciiBytesFrom24BitInteger - Extract some consts out to top of file w/ comments * PR feedback: Enable SSE2 in Utf16Utility code * Expand masks in Utf8Utility, fix const in fallback path * Temporarily disable failing CoreFX tests * Fix incorrect Debug.Assert statements * Add comments tracking JIT workarounds. * Rename DWORD -> UInt32 throughout API surface * Re-flow Utf8Utility.Helpers * PR feedback: Fix typos * PR feedback: CountNumberOfLeadingAsciiBytesFrom24BitInteger * PR feedback: Remove redundant endianess checks * PR feedback: Validate nint definitions * PR feedback: Clarify charIsNonAscii vector usage * PR feedback: document tempUtf8CodeUnitCountAdjustment usage * Fix compilation failure in Utf16Utility * PR feedback: Clarify 3-byte sequence processing * Add missing check to 3-byte processing logic * Clarify comment in 3-byte processing Commit migrated from dotnet/coreclr@77a09eb
This is the first batch of the improved UTF-8 validation and transcoding APIs. There's a single workhorse method used for validation and for counting the number of code units that would result from transcoding.
Philosophically, these methods differ from existing methods on
System.Text.Encoding
. The existingEncoding
methods really push consumers to transcode all data to UTF-16 so that they can take advantage of existing UTF-16 data manipulation and inspection APIs. Instead, these new APIs expose information about the underlying structure of the UTF-8 data itself. The intent is that error handling ("I found an invalid subsequence in the stream; what do I do now?") is pushed to a higher layer. That higher layer can be built on top of this API and other inspection APIs in order to provide the desired error handling semantics. This will be clearer when theOperationStatus
-based APIs come online in a future PR.Benchmarks are pending.
Unit tests at dotnet/corefx#34538.
Resolves https://github.com/dotnet/corefx/issues/36163.